Raft's Safety, Fault-Tolerance, and Availability Protocols
Let's learn how Raft ensures safety, handles leader and followers' crashes, and maintains availability.
Safety#
The previous lessons discussed how Raft selects leaders and replicates log entries. Still, additional mechanisms are needed to guarantee that every state machine executes the same commands in the same order. To see why this is the case, take an example of a follower that misses several log entries while the leader commits them. Such a follower can become the new leader and can overwrite the committed entries with new ones, resulting in different state machines executing different sequences of commands. The following slides show such a scenario:
1 of 3
2 of 3
3 of 3
To address this issue, the Raft algorithm restricts which servers can be elected as leaders to ensure that the leader for any given term contains all the entries committed in previous terms.
Election restriction#
In leader-based consensus algorithms, the leader is responsible for eventually storing all committed log entries. However, some algorithms allow a leader to be elected without initially having all committed entries. These algorithms require additional mechanisms to identify and transmit missing entries to the new leader during or after the election process, leading to increased complexity.
The following table enlists the two election restrictions set by Raft, along with their rationale.
Restriction | Rationale |
Raft ensures that all committed entries from previous terms are on each new leader from the moment of its election, so there is no need to transfer them afterward. | This ensures that log entries have a unidirectional flow from leaders to followers, and leaders never overwrite their logs' existing entries. |
Raft prevents a candidate from winning an election unless all the committed entries are in its log. It achieves this restriction through the voting process. The voter denies the vote's request of a candidate whose log is out-of-date, rather than their own log. | As the voting process dictates, the candidate must contact a majority of the cluster to become a leader. Any majority of the cluster node will have at least one node that has the latest committed data. The `RequestVote` RPC enforces this restriction in Raft by including the information about the candidate's log. |
Point to ponder
Question
What could be Raft’s criteria for the log to be considered up-to-date?
Raft compares the index and the term of the last entries in the logs to determine which of the two logs is more up-to-date. If the logs have last entries with different terms, the log with the later term is more up-to-date. The longer log is considered more up-to-date if the logs end with the same term.
Committing entries from previous terms#
The leader of a Raft cluster confirms the commitment of an entry from the current term once it has been stored on most of the servers. However, if the leader crashes before committing an entry, future leaders will try to replicate the entry. Nonetheless, if an entry from a previous term is stored on a majority of the servers, the leader cannot immediately deduce that it has been committed.
Before discussing Raft’s approach to commit log entries from previous terms, let’s discuss the issue of an old log entry stored on the majority of servers potentially getting overwritten by a future leader. This hypothetical scenario is illustrated below:
1 of 5
2 of 5
3 of 5
4 of 5
5 of 5
Rule to commit entries from previous logs: To counter an issue (as depicted in the third slide above), Raft employs the rule of not committing log entries from previous terms by counting replicas. The leaders only commit those log entries by counting replicas, which are present in their current term. Once an entry from the current term has been committed, all previous entries are indirectly committed due to the Log Matching Property. This ensures consistency of the log up to that point across all servers.
Point to ponder
Question
Could you think of a situation where it can be safely assumed that an older log entry is committed by just counting replicas?
There might be circumstances where a leader can safely deduce that an older log entry is committed, such as when it is stored on every server (the number of replicas equals the number of servers). However, Raft follows a more conservative approach for simplicity.
Rationale: Other consensus algorithms require the new leader to replicate previous entries with a new term number. In contrast, Raft keeps the original term numbers of the log entries even when a leader replicates them from previous terms. The approach used in Raft allows for the easier rationale behind log entries since they are associated with the same term number over terms and across logs. Furthermore, new leaders in Raft transmit fewer log entries from previous terms than other algorithms, which require transmitting redundant log entries to renumber them before they can be committed.
Safety argument#
The argument that Raft ensures safety rests on the Leader’s Completeness Property. It states that if a log entry gets committed in a leader’s given term, all future leaders will have that entry in their logs.
The following hint widget gives proof that this property holds.
We can now demonstrate the validity of the Leader Completeness Property in the complete Raft algorithm through a proof by contradiction. We assume that the Leader Completeness Property does not hold. Here’s how:
Suppose that the term leader () commits a log entry from its term, but the leader () of some future term, (the smallest term ), does not store that entry.
-
The 's log must not have the committed log entry at election time.
-
The has replicated the committed entry on a majority of the cluster, and has received votes from a majority of the cluster. There must be at least one server, the voter, that accepted both the entry from and voted for .
If S5 is chosen as leader for a later term after S1 (the leader for term ) commits a new log item from its term, then at least one server (S3) must have accepted the log entry and voted for S5 at the same time.
-
The voter accepted the committed entry from before voting for .
-
The voter still had the entry stored in its log when it voted for because every intervening leader contained the entry.
-
The voter voted for , which means that the 's log must have been as up-to-date as the voter’s log. This leads to one of the two contradictions.
-
If the voter and had a similar last term in their logs, then 's log must have been at least equal to the voter’s log. This is a contradiction because the voter already had the committed entry in its log, and the entry was assumed to be absent in 's log.
-
Differently, the last log term of the must have been larger than the voter’s last log term. Moreover, it was greater than since the voter’s last log term was at least . The earlier leader that created 's last log entry must have contained the committed entry in its log. Then, based on the Log Matching Property, 's log must also contain the committed entry, which is also a contradiction.
-
Therefore, the leaders of all terms greater than must have all entries committed in the term .
-
The Log Matching Property guarantees that future leaders will also contain indirectly committed entries.
In conclusion, we can prove that the Leader Completeness Property holds by assuming that it does not hold and then demonstrating a contradiction based on the committed entry from not being stored by the leader of some future term.
Follower and candidate crashes#
Up to this point, we have focused on handling leader failures. Handling follower and candidate crashes is simpler than leader crashes, and the Raft algorithm deals with them similarly. If a follower or candidate crashes, any future RequestVote and AppendEntries RPCs sent to it will be unsuccessful. Raft manages these failures by making indefinite retries. Later, if the crashed server restarts, the RPC will succeed. If a server crashes after executing an RPC but before responding, it will receive the same RPC again after it restarts. Since Raft’s RPCs are idempotent, the repeated RPC causes no harm. For instance, when a follower gets an AppendEntries request that includes log entries already in its log, it disregards those entries in the new request.
Timing and availability#
To ensure the safety of Raft, it is essential that the system does not produce incorrect results solely based on timing. However, the system’s availability, which measures the system’s ability to respond to clients in a timely fashion, is reliant on timing. For instance, if message exchanges take longer than the usual time between server crashes, candidates will not remain active enough to win an election. Without a stable leader, the system cannot progress.
Leader election is a critical aspect of Raft that relies heavily on timing. To elect and maintain a steady leader, the system must meet the following timing requirement:
Here:
- The average time a server takes to send RPCs in parallel to every other server in the cluster and receive their responses is called broadcastTime.
- The election timeout, as described in Raft’s Leader Election Protocol, is known as the electionTimeout.
- MTBF refers to the average time between failures for a single server.
The broadcast time must be an order of magnitude less than the election timeout to ensure leaders can reliably send the required heartbeat messages to prevent followers from initiating elections. This inequality also reduces the possibility of split votes based on the randomized approach used for election timeouts. To ensure the system progresses steadily, the election timeout has to be a few orders of magnitude less than MTBF. When the leader crashes, the system gets unavailable for nearly the election timeout, representing only a small fraction of the overall time.
While the broadcast time and MTBF are properties of the underlying system, the election timeout is a variable that we need to choose appropriately. Raft’s RPCs typically require the recipient to store information in stable storage, so depending upon the storage technology, the broadcast time may range from 0.5ms to 20ms. Consequently, the election timeout will likely be between 10ms and 500ms. The typical server MTBFs are several months or more, comfortably fulfilling the timing requirement.
In the next lesson, we’ll discuss how Raft implements cluster membership changes in its algorithm.
Raft's Log Replication Protocol
Raft's Cluster Membership Changes